Skip to content

Conversation

iximeow
Copy link
Member

@iximeow iximeow commented Jul 30, 2025

this materializes RFD 314 and in some respects, 505.

builds on #8725 for CPU family information, which is a stand-in for the notion of sled families and generations described in RFD 314. There are a few important details here where CPU platforms differ from the sled CPU family and I've differed from 314/505 (and need to update the RFDs to match). I'd not noticed the sheer volume of comments on https://github.com/oxidecomputer/rfd/pull/515 so I'm taking a pass through those and the exact bits in MILAN_CPUID may be further tweaked. I suspect the fixed array needs at least a few more tweaks anyway, cross-referencing RFD 314 turns out to make for awkward review and it's hard to eyeball the semantic of bits here (or which are to be filled in by some later component of the stack!)

As-is: I think would be OK to merge but is not quite as polished as I'd like it to be, so it's a real PR but I expect further changes.

hardware CPU families are less linear than Oxide CPU platforms.

We can (and do, in RFD 314) define Milan restrictively enough that we can present Turin (and probably later!) CPUs to guests "as if" they were Milan. Similarly I'd expect that Turin would be defined as roughly "Milan-plus-some-AVX-512-features" and pretty forward-compatible. Importantly these are related to but not directly representative of real CPUs; as an example I'd expect "Turin"-the-instance-CPU-platform to be able to run on a Turin Dense CPU. Conversely, there's probably not a reason to define a "Turin Dense" CPU platform since from a guest perspective they'd look about the same.

But at the same time the lineage through the AMD server part family splits at Zen 4 kind of, with Zen 4 vs Zen 4c-based parts and similar with Zen 5/c. It's somewhat hard (I think) to predict what workloads would be sensitive to this. And as #8730 gets into a bit, the details of a processor's packaging (core topology, frequency, cache size) can vary substantially even inside one CPU family. The important detail here is that we do not expect CPU platforms to cover these details and it would probably be cumbersome to try; if the instance's constraint is "I want AVX256, and I want to be on high-frequency-capable processors only", then it doesn't actually matter if it's run on a Turin or a Milan and to tie it to that CPU platform may be overly restrictive.

On instance CPU platforms, the hope is that by focusing on CPU features we're able to present a more linear path as the microarchitecture grow.

instance platforms aren't "minimum"

I've walked back the initial description of an instance's CPU platform as the "minimum CPU platform". As present in other systems, "minimum CPU platform" would more analogously mean "can we put you on a Rome Gimlet or must we put you on a Milan Gimlet?", or "Genoa Cosmo vs Turin Cosmo?" - it doesn't seem possible to say "this instance must have AVX 512, but otherwise I don't care what kind of hardware it runs on.", but that's more what we mean by CPU platform.

In a "minimum CPU platform" interpretation, we could provide a bunch of Turin CPUID bits to a VM that said it wanted Milan. But since there's no upper bound here, if an OS has an issue with a future "Zen 14" or whatever, a user would discover that by their "minimum-Milan" instance getting scheduled on the new space-age processor and exploding on boot or something. OSes shouldn't do that, but...

Implementation-wise, this is really just about the names right now. You always get Milan CPUID leaves for the time being. When there are Turin CPUID leaves defined for the instance CPU platform, and Cosmos on which they make sense, this becomes more concrete.

"are these CPU platforms compatible?"

RFD 314 deserves a section outlining how we determine a new CPU can be a stand-in for an older platform, talking about this and other leaves (particularly if we pass topology information through, elsewhere). I've added a few words about the problem of a CPUID profile being reasonable to present on some hardware's different CPUID profile, but I've also outlined what I've spotted as potential incompatibilities in the currently-dead functionally_same(). This is one of those things that will have the shape of "very boring and straightforward for eight years, and then a bit will change (or be defined) that's kind of annoying".

RFD 505 proposes that instances should be able to set a "minimum
hardware platform" or "minimum CPU platform" that allows users to
constrain an instance to run on sleds that have a specific set of CPU
features available. This allows a user to opt a VM into advanced
hardware features (e.g. AVX-512 support) by constraining it to run only
on sleds that support those features. For this to work, Nexus needs to
understand what CPUs are present in which sleds.

Have sled-agent query CPUID to get CPU vendor and family information and
report this to Nexus as part of the sled hardware manifest.
@iximeow iximeow force-pushed the minimum-cpu-platforms branch from fb1e0a7 to 620d7c9 Compare July 30, 2025 01:01
iximeow added 11 commits July 30, 2025 19:03
the existing plumbing was sufficient for sled-agent to report the CPU
family at startup, but did not provide the CPU family when Nexus calls
later for inventory collections. when you've upgraded to this version,
the database migration sets the sled CPU family to `unknown` expecting
that the next inventory collection will figure things out. this doesn't
happen, and the initial check-in doesn't update the CPU type either
(presumably because the sled is already known and initialized from the
control plane's perspective?)

this does... most of the plumbing to report a sled's CPU family for
inventory collection, but it doesn't actually work. `SledCpuFamily`
being both in `omicron-common` and `nexus-client` is kind of unworkable.
probably need a `ConvertInto` or something to transform the shared into
the `nexus-client` when needed..? i've been trying to figure out what
exactly is necessary and what is just building a mess for myself for two
hours and this feels like it's going nowhere.
Copy link
Member Author

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally got to taking a fine-tooth comb through the CPUID bits here and differences between hardware and what guests currently see. for the most part this is in line with what guests already get from byhve defaults but i've noticed a few typos that unsurprisingly do not pose an issue booting at least Alpine guests. i'll clean that up and update 314 appropriately tomorrow.

// See [RFD 314](https://314.rfd.oxide.computer/) section 6 for all the
// gnarly details.
const MILAN_CPUID: [CpuidEntry; 32] = [
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a guest currently sees eax=0x10 here, where leaves 0xe, 0xf, and 0x10 are all zeroes. 0xe is reserved and zero on the host. 0xf and 0x10 are zeroed because of the default byhve masking behavior. setting eax=0xd is just more precise: leaves 0xf and 0x10 being "present" but zero does not communicate any feature presence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know we reference RFD 314 but it kinda seems like it would be nice to have the comment explaining this here?###

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went a bit hard on the comment and now there's like 600 more lines of more-a-code-than-a-comment (https://github.com/oxidecomputer/omicron/pull/8728/files#diff-0c754f739cd972f46b539d2a2e5a6220cd0b72e8c9fe7f20a2013fab3f28aa21R167)

0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000
),
cpuid_subleaf!(
0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as RFD 314 says, but eax ought to be different in the RFD. Propolis fixes up B.1.EBX here to the number of vCPUs, but eax of 0 implies that the subleaf is actually invalid (from the APM: "If this function is executed with an unimplemented level (passed in ECX), the instruction returns all zeroes in the EAX register." .. also taken faithfully, this implies the VM's topology is vCPU-many sockets with SMT pairs across each socket pair. oops).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably update the RFD to reflect that? also lol Oxide is the ONLY platform that lets you make a 64-socket VM...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it turns out the RFD is fine here, there's a later sentence that says "Index 1's eax and ebx depend on the VM shape". But Propolis is wrong here: oxidecomputer/propolis#939

So I'm going to hide this leaf in the proposed Milan v1 platform, fix the Propolis bug, and we can expose leaf Bh in a v2.

iximeow and others added 9 commits August 6, 2025 20:34
RFD 505 proposes that instances should be able to set a "minimum
hardware platform" or "minimum CPU platform" that allows uers to
constrain an instance to run on sleds that have a specific set of CPU
features available. Previously, actually-available CPU information was
plumbed from sleds to Nexus. This actually adds a `min_cpu_platform`
setting for instance creation and uses it to drive selection of guest
CPUID leaves.

As-is, this moves VMs on Gimlets away from the byhve-default CPUID
leaves (which are effectively "host CPUID information, but features that
are not or cannot be virtualized are masked out"), instead using the
specific CPUID information set out in RFD 505.

There is no provision for Turin yet, which instead gets CPUID leaves
that look like Milan. Adding a set of CPUID information to advertise for
an `amd_turin` CPU platform, from here, is fairly straightforward.

This does not have a mechanism to enforce specific CPU platform use or
disuse, either in a silo or rack-wide. One could imagine a simple system
oriented around "this silo is permitted to specify these minimum CPU
platforms", but that leaves uncomfortable issues like: if a silo A
permits only Milan, and silo B permits Milan and Turin, all Milan CPUs
are allocated already, and someone is attemting to create a new
Milan-based VM in silo A, should this succeed using Turin CPUs
potentially starving silo B?
@iximeow iximeow force-pushed the minimum-cpu-platforms branch from 620d7c9 to 5cf7b9c Compare August 7, 2025 18:20
@iximeow iximeow marked this pull request as ready for review August 7, 2025 19:54
@@ -612,6 +613,7 @@ mod tests {
external_ips: vec![],
disks: vec![],
boot_disk: None,
cpu_platform: None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure is weird that this file doesn't include a change to SledReservationConstraintBuilder like in instance_start.rs huh?

that's a bug. this currently would allow a Turin to migrate onto a Milan (bad)

Base automatically changed from gjc/report-sled-families to main August 12, 2025 23:16
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a cursory initial review. Beyond the things you've already identified, this change seems like it's going in the right direction.

Comment on lines +1290 to +1291
/// hypervisor support may be masked off. See RFD 314 for specific CPU features
/// in a CPU platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RFD 314 public, and do we intend to make it public? If we're referencing it in a comment that will eventually make it into the public API docs, perhaps we should?

Comment on lines +1319 to +1322
* If this is NULL, the control plane ignores CPU constraints when selecting
* a sled for this instance. Then, once it has selected a sled, it supplies
* a "lowest common denominator" CPU platform that is compatible with that
* sled to maximize the number of sleds the VM can migrate to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit, sorry: it might be nice to have a similar comment in the db model Instance type --- when reading the type definition, i wasn't sure why this was null, so had to go check dbinit for it

Comment on lines 31 to 32
KnownVersion::new(175, "add-instance-cpu-platform"),
KnownVersion::new(174, "sled-cpu-family"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it, but i think it would've been fine to do this in a single migration --- though maybe separating them is nicer since they change different tables...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a weird github thing.. as a draft this was on top of another PR that added sled CPU family reporting, which is merged, and i think was migration 174! i thought this would evaporate out of the diff. i'll have to take a closer look at what's going on with that..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, so when landing the sled CPU family PR i had to bump the schema number a bit so as to actually make it the latest, and i did not rebase this PR on the commit that finally went in. i've merged main and this whole diff is much more reasonable now.

Comment on lines +50 to +54
// VMMs get the "sled default" CPU platform when an instance starts
// up on a sled that hasn't reported a well-known CPU family. Assume
// that nothing is known about the VM's compatible CPU platforms in
// this case.
Self::SledDefault => None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this...feels like some combination of reasonable and sketchy. i kinda wonder if we ought to just never schedule instances on sleds that haven't reported a well-known CPU family? i want to not have a situation where a VM experiences a weird blip in virtual platform due to getting scheduled on a Mystery Sled..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.. this is kind of a provision to make test environments not require Exactly A Zen 3 Or Zen 5 Or Carry Your Own Patches - to get here you have to have been scheduled on a SledCpuFamily::Unknown, and for that to have happened either we've shipped a sled-agent that doesn't know how to identify a sled's CPU anymore, or we've shipped a sled that sled-agent doesn't know about and placed instances on it before updating it. (even if you set the instance CPU platform to null through the API you end up at the minimum compatible platform for the selected sled, so there's no way to get here from just API actions)

the good/bad news is that in this case you won't be able to migrate the instance anyway, so it's a "blip" in that the instance will just be this way until you stop and start it again..

",
);

// TODO(gjc): eww. the correct way to do this is to write this as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO(gjc):

😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greg would be glad to know that the way to spell this that Diesel and Cockroach are happy about is fully baffling: 75ed627

0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000
),
cpuid_subleaf!(
0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably update the RFD to reflect that? also lol Oxide is the ONLY platform that lets you make a 64-socket VM...

Comment on lines +226 to +235
// If the instance didn't supply a CPU platform, select one for this VMM by
// looking up the chosen sled and selecting the "minimum compatible
// platform" for sleds of that lineage. This maximizes the number of sleds
// that can host the VMM if it needs to migrate in the future. Selecting the
// sled first and then deriving the platform is meant to support
// heterogeneous deployments: if a deployment contains some sleds with CPUs
// from vendor A, and some with CPUs from vendor B, then selecting the sled
// first implicitly chooses a vendor, and then the "minimum compatible"
// computation selects the most compatible platform that can run on sleds
// with CPUs from that vendor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially felt a bit uncomfy about leaving it unbound until the instance is started (mainly just because I was trying to think my way out of having the db column be nullable), but I think that not picking something until the instance is started is the right move, so that it can be based on the sled that ends up hosting the instance, and we can select that sled in a less constrain-y way.

Comment on lines 1267 to 1268
/// The CPU platform to be used for this instance. If this is `null`, the
/// instance requires no particular CPU platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this comment generates API docs, perhaps we should discuss what happens if you set it to null here? I would want to know that "the instance requires no particular CPU platform" also means "...and will be implicitly bound to the CPU platform of whatever the first sled to host it is", rather than meaning "we will always let you schedule that instance anywhere and if the guest is confused by this it's your fault"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"we will always let you schedule that instance anywhere and if the guest is confused by this it's your fault"

that is actually what it means lol, so yes definitely this deserves a note. the platform that's bound is on the VMM, specifically so that we can't migrate you to a different kind of system once your instance is running. but if you stop it and start it again without an explicit CPU platform you can go anywhere.


impl std::fmt::Display for SledCpuFamily {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.as_str())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: slightly simpler to just do

Suggested change
write!(f, "{}", self.as_str())
f.write_str(self.as_str())

* leaf 1 should *not* have x2APIC (yet)
* leaf 7 EDX bit 4 indicates FSRM, should be set
also rustfmt, resolve warnings, and fill in explicit zero leaves so that
the RFD 314-defined CPUID table matches what the builder produces now.
yay!!
Copy link
Member Author

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frankly i did not realize how nice being able to write a raw_cpuid::CpuId would be, and how much more legible that makes this PR (and RFD 314, and future CPU platforms, and now we can pretty trivially make Propolis validate CPUID bits its told, and ...)

}

#[test]
fn milan_rfd314_is_as_described() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other cool thing we can do now is grab a CPUID dump from a guest before using explicit definitions and include a test describing how it and the RFD 314 definitions differ. i want to include that alongside this test before merging (would have been so much nicer than the manual comparison i did before...)

short history here: this bit is set in hardware, the extended space is
not supported by bhyve, but the bit was passed through. hiding this bit
"should" be fine, but out of caution we're not doing that in the first
statically-defined CPU platform. this bit was not set in the MILAN_CPUID
blob, though, even though the RFD was updated.
these tests are:
* the `milan_rfd314` function computes something that matches RFD 314
* the RFD 314 definition differs in specific and justified ways from
  pre-314 guest CPUID profiles.
@iximeow
Copy link
Member Author

iximeow commented Sep 3, 2025

the new #[test] fn milan_current_vs_rfd314_is_understood() describes the differences between a guest today's CPUID profile and what's in this PR in a more durable (and imo reasonable!) way than a bunch of in-line comments on this PR. i've resolved all the "a guest currently sees" comments on the RFD 314 Milan in favor of that, and in a few cases, comments there.

For a reason I can't figure out, when this was
```
AND sled.cpu_family IN (...)
```
binding a parameter made that query end up like
```
AND sled.cpu_family IN ($1,)
```
which then expects a single `sled_cpu_family` enum for later binding.
Binding an array then yielded a type error like
```
DatabaseError(Unknown, "invalid cast: sled_cpu_family[] -> sled_cpu_family")
```

... but writing it like `AND sled.cpu_family = ANY (...)` does not get
the extra comma and instead renders like `ANY ($1)` which happily takes
a `sled_cpu_family[]`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants